Weakly supervised semantic segmentation (WSSS) with image-level labels is a challenging task in computer vision. Mainstream approaches follow a multi-stage framework and suffer from high training costs. In this paper, we explore the potential of Contrastive Language-Image Pre-training models (CLIP) to localize different categories with only image-level labels and without any further training. To efficiently generate high-quality segmentation masks from CLIP, we propose a novel framework called CLIP-ES for WSSS. Our framework improves all three stages of WSSS with special designs for CLIP: 1) We introduce the softmax function into GradCAM and exploit the zero-shot ability of CLIP to suppress the confusion caused by non-target classes and backgrounds. Meanwhile, to take full advantage of CLIP, we re-explore text inputs under the WSSS setting and customize two text-driven strategies: sharpness-based prompt selection and synonym fusion. 2) To simplify the stage of CAM refinement, we propose a real-time class-aware attention-based affinity (CAA) module based on the inherent multi-head self-attention (MHSA) in CLIP-ViTs. 3) When training the final segmentation model with the masks generated by CLIP, we introduced a confidence-guided loss (CGL) to mitigate noise and focus on confident regions. Our proposed framework dramatically reduces the cost of training for WSSS and shows the capability of localizing objects in CLIP. Our CLIP-ES achieves SOTA performance on Pascal VOC 2012 and MS COCO 2014 while only taking 10% time of previous methods for the pseudo mask generation. Code is available at https://github.com/linyq2117/CLIP-ES.
translated by 谷歌翻译
文本识别是文档数字化的长期研究问题。现有的方法通常是基于CNN构建的,以用于图像理解,并为Char-Level文本生成而建立RNN。此外,通常需要另一种语言模型来提高整体准确性作为后处理步骤。在本文中,我们提出了一种使用预训练的图像变压器和文本变压器模型(即Trocr)提出的端到端文本识别方法,该模型利用了变压器体系结构,以实现图像理解和文字级级文本生成。TROR模型很简单,但有效,可以通过大规模合成数据进行预训练,并通过人体标记的数据集进行微调。实验表明,TROR模型的表现优于印刷,手写和场景文本识别任务上的当前最新模型。Trocr模型和代码可在\ url {https://aka.ms/trocr}上公开获得。
translated by 谷歌翻译
我们开发了一个新颖的框架,将稀疏集团拉索的正规化者添加到深度学习中的自适应优化者家族中,例如动量,亚当,亚当,阿姆斯格拉德,阿德哈西亚人,并创建了新的优化者,这些优化者被称为群体动量,命名因此,Adagrad小组,亚当集团,Amsgrad集团和Adahessian集团等。我们基于原始偶的方法在随机凸设置中建立理论上证明的收敛保证。我们评估了新优化器对具有最先进的深度学习模型的三个大型现实广告单击数据集的正则效应。实验结果表明,与使用幅度修剪方法的后处理过程相比,模型的性能可以在相同的稀疏度水平上显着提高。此外,与没有幅度修剪的情况相比,我们的方法可以实现极高的稀疏性,并具有明显的更好或高度竞争性的性能。
translated by 谷歌翻译
基于补丁的方法和深度网络已经采用了解决图像染色问题,具有自己的优势和劣势。基于补丁的方法能够通过从未遮盖区域搜索最近的邻居修补程序来恢复具有高质量纹理的缺失区域。但是,这些方法在恢复大缺失区域时会带来问题内容。另一方面,深度网络显示有希望的成果完成大区域。尽管如此,结果往往缺乏类似周围地区的忠诚和尖锐的细节。通过汇集两个范式中,我们提出了一种新的深度染色框架,其中纹理生成是由从未掩蔽区域提取的补丁样本的纹理记忆引导的。该框架具有一种新颖的设计,允许使用深度修复网络训练纹理存储器检索。此外,我们还介绍了贴片分配损失,以鼓励高质量的贴片合成。所提出的方法在三个具有挑战性的图像基准测试中,即地位,Celeba-HQ和巴黎街道视图数据集来说,该方法显示出质量和定量的卓越性能。
translated by 谷歌翻译
Recent works on domain adaptation reveal the effectiveness of adversarial learning on filling the discrepancy between source and target domains. However, two common limitations exist in current adversarial-learning-based methods. First, samples from two domains alone are not sufficient to ensure domain-invariance at most part of latent space. Second, the domain discriminator involved in these methods can only judge real or fake with the guidance of hard label, while it is more reasonable to use soft scores to evaluate the generated images or features, i.e., to fully utilize the inter-domain information. In this paper, we present adversarial domain adaptation with domain mixup (DM-ADA), which guarantees domain-invariance in a more continuous latent space and guides the domain discriminator in judging samples' difference relative to source and target domains. Domain mixup is jointly conducted on pixel and feature level to improve the robustness of models. Extensive experiments prove that the proposed approach can achieve superior performance on tasks with various degrees of domain shift and data complexity.
translated by 谷歌翻译
Traffic flow prediction is an important part of smart transportation. The goal is to predict future traffic conditions based on historical data recorded by sensors and the traffic network. As the city continues to build, parts of the transportation network will be added or modified. How to accurately predict expanding and evolving long-term streaming networks is of great significance. To this end, we propose a new simulation-based criterion that considers teaching autonomous agents to mimic sensor patterns, planning their next visit based on the sensor's profile (e.g., traffic, speed, occupancy). The data recorded by the sensor is most accurate when the agent can perfectly simulate the sensor's activity pattern. We propose to formulate the problem as a continuous reinforcement learning task, where the agent is the next flow value predictor, the action is the next time-series flow value in the sensor, and the environment state is a dynamically fused representation of the sensor and transportation network. Actions taken by the agent change the environment, which in turn forces the agent's mode to update, while the agent further explores changes in the dynamic traffic network, which helps the agent predict its next visit more accurately. Therefore, we develop a strategy in which sensors and traffic networks update each other and incorporate temporal context to quantify state representations evolving over time.
translated by 谷歌翻译
As the basis for prehensile manipulation, it is vital to enable robots to grasp as robustly as humans. In daily manipulation, our grasping system is prompt, accurate, flexible and continuous across spatial and temporal domains. Few existing methods cover all these properties for robot grasping. In this paper, we propose a new methodology for grasp perception to enable robots these abilities. Specifically, we develop a dense supervision strategy with real perception and analytic labels in the spatial-temporal domain. Additional awareness of objects' center-of-mass is incorporated into the learning process to help improve grasping stability. Utilization of grasp correspondence across observations enables dynamic grasp tracking. Our model, AnyGrasp, can generate accurate, full-DoF, dense and temporally-smooth grasp poses efficiently, and works robustly against large depth sensing noise. Embedded with AnyGrasp, we achieve a 93.3% success rate when clearing bins with over 300 unseen objects, which is comparable with human subjects under controlled conditions. Over 900 MPPH is reported on a single-arm system. For dynamic grasping, we demonstrate catching swimming robot fish in the water.
translated by 谷歌翻译
Key performance indicators(KPIs) are of great significance in the monitoring of wireless network service quality. The network service quality can be improved by adjusting relevant configuration parameters(CPs) of the base station. However, there are numerous CPs and different cells may affect each other, which bring great challenges to the association analysis of wireless network data. In this paper, we propose an adjustable multi-level association rule mining framework, which can quantitatively mine association rules at each level with environmental information, including engineering parameters and performance management(PMs), and it has interpretability at each level. Specifically, We first cluster similar cells, then quantify KPIs and CPs, and integrate expert knowledge into the association rule mining model, which improve the robustness of the model. The experimental results in real world dataset prove the effectiveness of our method.
translated by 谷歌翻译
Previous work on action representation learning focused on global representations for short video clips. In contrast, many practical applications, such as video alignment, strongly demand learning the intensive representation of long videos. In this paper, we introduce a new framework of contrastive action representation learning (CARL) to learn frame-wise action representation in a self-supervised or weakly-supervised manner, especially for long videos. Specifically, we introduce a simple but effective video encoder that considers both spatial and temporal context by combining convolution and transformer. Inspired by the recent massive progress in self-supervised learning, we propose a new sequence contrast loss (SCL) applied to two related views obtained by expanding a series of spatio-temporal data in two versions. One is the self-supervised version that optimizes embedding space by minimizing KL-divergence between sequence similarity of two augmented views and prior Gaussian distribution of timestamp distance. The other is the weakly-supervised version that builds more sample pairs among videos using video-level labels by dynamic time wrapping (DTW). Experiments on FineGym, PennAction, and Pouring datasets show that our method outperforms previous state-of-the-art by a large margin for downstream fine-grained action classification and even faster inference. Surprisingly, although without training on paired videos like in previous works, our self-supervised version also shows outstanding performance in video alignment and fine-grained frame retrieval tasks.
translated by 谷歌翻译
对比性语言图像预测在学习网络尺度数据的视觉文本联合表示方面取得了巨大的成功,这表明了各种图像任务的显着“零射”概括能力。但是,如何有效地将这种新的语言图像预处理方法扩展到视频域仍然是一个开放的问题。在这项工作中,我们提出了一种简单而有效的方法,该方法将预验证的语言图像模型直接适应视频识别,而不是从头开始预处理新模型。更具体地说,为了捕获沿时间维度框架的远距离依赖性,我们提出了一种跨框架注意机制,该机制明确地跨帧交换信息。这样的模块是轻量级的,可以无缝地插入验证的语言图像模型中。此外,我们提出了一个特定于视频的提示方案,该方案利用视频内容信息生成歧视性文本提示。广泛的实验表明,我们的方法是有效的,可以推广到不同的视频识别方案。特别是,在完全监督的设置下,我们的方法在Kinectics-400上获得了最高1的精度为87.1%,而与SWIN-L和Vivit-H相比,使用量少12倍。在零拍摄的实验中,我们的方法超过了当前的最新方法 +7.6%和 +14.9%,而在两个流行协议下,TOP-1的准确性。在少数拍摄的情况下,当标记的数据非常有限时,我们的方法优于先前的最佳方法 +32.1%和 +23.1%。代码和型号可在https://aka.ms/x-clip上找到
translated by 谷歌翻译